2025-05-19-20-06
Interpretable Risk Mitigation in LLM Agent Systems
Abstract
arXiv:2505.10670v1 Announce Type: new Abstract: Autonomous agents powered by large language models (LLMs) enable novel use cases in domains where responsible action is increasingly important. Yet the inherent unpredictability of LLMs raises safety concerns about agent reliability. In this work, we explore agent behaviour in a toy, game-theoretic environment based on a variation of the Iterated Prisoner's Dilemma. We introduce a strategy-modification method-independent of both the game and the prompt-by steering the residual stream with interpretable features extracted from a sparse autoencoder latent space. Steering with the good-faith negotiation feature lowers the average defection probability by 28 percentage points. We also identify feasible steering ranges for several open-source LLM agents. Finally, we hypothesise that game-theoretic evaluation of LLM agents, combined with representation-steering alignment, can generalise to real-world applications on end-user devices and embodied platforms.
摘要
由大语言模型(LLMs)驱动的自主智能体在责任行为日益重要的领域展现出新颖的应用前景。然而LLMs固有的不可预测性引发了关于智能体可靠性的安全担忧。本研究基于迭代囚徒困境的变体,在一个玩具博弈论环境中探索智能体行为。我们提出了一种独立于游戏规则和提示词的策略修改方法——通过利用稀疏自编码器潜在空间中提取的可解释特征来引导残差流。实验表明,采用诚信协商特征进行引导时,平均背叛概率降低了28个百分点。同时,我们确定了多个开源LLM智能体的可行引导范围。最后,我们提出假设:结合表征引导对齐的博弈论评估方法,可推广至终端用户设备和实体化平台的实际应用场景。
Evaluations at Work: Measuring the Capabilities of GenAI in Use
Abstract
arXiv:2505.10742v1 Announce Type: new Abstract: Current AI benchmarks miss the messy, multi-turn nature of human-AI collaboration. We present an evaluation framework that decomposes real-world tasks into interdependent subtasks, letting us track both LLM performance and users' strategies across a dialogue. Complementing this framework, we develop a suite of metrics, including a composite usage derived from semantic similarity, word overlap, and numerical matches; structural coherence; intra-turn diversity; and a novel measure of the "information frontier" reflecting the alignment between AI outputs and users' working knowledge. We demonstrate our methodology in a financial valuation task that mirrors real-world complexity. Our empirical findings reveal that while greater integration of LLM-generated content generally enhances output quality, its benefits are moderated by factors such as response incoherence, excessive subtask diversity, and the distance of provided information from users' existing knowledge. These results suggest that proactive dialogue strategies designed to inject novelty may inadvertently undermine task performance. Our work thus advances a more holistic evaluation of human-AI collaboration, offering both a robust methodological framework and actionable insights for developing more effective AI-augmented work processes.
摘要
当前的人工智能基准测试未能捕捉人机协作中混乱、多轮交互的本质。我们提出一个评估框架,将现实任务分解为相互依赖的子任务,从而能够追踪对话过程中大语言模型的性能表现与用户策略。作为该框架的补充,我们开发了一套评估指标,包括基于语义相似度、词汇重叠率和数值匹配的综合使用度、结构连贯性、轮内多样性,以及反映AI输出与用户既有知识对齐程度的创新性"信息前沿"指标。我们通过模拟真实复杂度的金融估值任务验证了这一方法论。实证研究表明:虽然更深度整合大语言模型生成内容通常能提升输出质量,但其效益会受到响应不连贯、子任务多样性过高、所提供信息与用户既有知识距离过远等因素的调节。这些发现表明,旨在注入新颖性的主动对话策略可能无意中损害任务表现。本研究由此推进了对人机协作更全面的评估,既提供了严谨的方法论框架,也为开发更有效的人工智能增强工作流程给出了可操作的见解。
Embodied AI in Machine Learning -- is it Really Embodied?
Abstract
arXiv:2505.10705v1 Announce Type: new Abstract: Embodied Artificial Intelligence (Embodied AI) is gaining momentum in the machine learning communities with the goal of leveraging current progress in AI (deep learning, transformers, large language and visual-language models) to empower robots. In this chapter we put this work in the context of "Good Old-Fashioned Artificial Intelligence" (GOFAI) (Haugeland, 1989) and the behavior-based or embodied alternatives (R. A. Brooks 1991; Pfeifer and Scheier 2001). We claim that the AI-powered robots are only weakly embodied and inherit some of the problems of GOFAI. Moreover, we review and critically discuss the possibility of cross-embodiment learning (Padalkar et al. 2024). We identify fundamental roadblocks and propose directions on how to make progress.
摘要
具身人工智能(Embodied AI)正在机器学习领域获得持续关注,其目标是通过利用当前人工智能领域(深度学习、Transformer架构、大语言模型及视觉-语言模型)的进展来增强机器人能力。本章将这项工作置于"经典人工智能"(GOFAI)(Haugeland, 1989)与基于行为或具身的替代方案(R. A. Brooks 1991; Pfeifer和Scheier 2001)的理论框架中进行探讨。我们认为当前AI驱动的机器人仅具备弱具身性,并继承了经典人工智能的某些固有问题。此外,我们系统评述并批判性讨论了跨具身学习(Padalkar等, 2024)的可能性。研究揭示了根本性障碍,并就突破方向提出了建议。
PoE-World: Compositional World Modeling with Products of Programmatic Experts
Abstract
arXiv:2505.10819v1 Announce Type: new Abstract: Learning how the world works is central to building AI agents that can adapt to complex environments. Traditional world models based on deep learning demand vast amounts of training data, and do not flexibly update their knowledge from sparse observations. Recent advances in program synthesis using Large Language Models (LLMs) give an alternate approach which learns world models represented as source code, supporting strong generalization from little data. To date, application of program-structured world models remains limited to natural language and grid-world domains. We introduce a novel program synthesis method for effectively modeling complex, non-gridworld domains by representing a world model as an exponentially-weighted product of programmatic experts (PoE-World) synthesized by LLMs. We show that this approach can learn complex, stochastic world models from just a few observations. We evaluate the learned world models by embedding them in a model-based planning agent, demonstrating efficient performance and generalization to unseen levels on Atari's Pong and Montezuma's Revenge. We release our code and display the learned world models and videos of the agent's gameplay at https://topwasu.github.io/poe-world.
摘要
学习世界运作机制是构建能够适应复杂环境的人工智能代理的核心任务。传统基于深度学习的世界模型需要大量训练数据,且无法通过稀疏观察灵活更新知识。近期利用大型语言模型(LLM)进行程序合成的研究进展提供了一种替代方案,该方法可学习以源代码表示的世界模型,实现少量数据下的强泛化能力。目前,程序结构化世界模型的应用仍局限于自然语言和网格世界领域。我们提出一种新颖的程序合成方法,通过将世界模型表示为LLM合成的程序专家指数加权乘积(PoE-World),有效建模复杂的非网格世界领域。研究表明,该方法仅需少量观察即可学习复杂的随机世界模型。我们通过将习得的世界模型嵌入基于模型的规划代理进行评估,在Atari的《Pong》和《蒙特祖马的复仇》游戏中展现出高效性能及对未见过关卡的泛化能力。代码已开源,学习到的世界模型及代理游戏视频详见https://topwasu.github.io/poe-world。
Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models
Abstract
arXiv:2505.10844v1 Announce Type: new Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.
摘要
准确度仍是评估人工智能系统的标准指标,但其对模型求解过程的揭示有限。本研究提出一种基于叙事式长文本谜题的基准测试,旨在深入探究模型采用的推理策略类型。谜题特别适合此目标,因其可通过多种方法求解——既可利用创造性洞察实现简短解答,亦可采用更耗时的暴力求解法。我们通过多层次推理研究大型语言模型(LLMs),不仅关注答案正确性,更着重分析解决方案的质量与创造性。研究涵盖推理过程的多个维度:(1) 将谜题语义解析为精确的数学竞赛格式;(2) 基于数学形式生成解决方案;(3) 根据标准答案自我修正解;(4) 生成分步解决方案框架;(5) 利用提示信息。研究发现LLMs在多案例中能提出具有创造性和洞察力的解法,表明其已具备以创新方式解决新问题的部分能力。然而,当存在更高效创新解法时,模型仍存在依赖暴力求解的情况,这揭示了LLMs推理能力有待改进的方向。
Vaiage: A Multi-Agent Solution to Personalized Travel Planning
Abstract
arXiv:2505.10922v1 Announce Type: new Abstract: Planning trips is a cognitively intensive task involving conflicting user preferences, dynamic external information, and multi-step temporal-spatial optimization. Traditional platforms often fall short - they provide static results, lack contextual adaptation, and fail to support real-time interaction or intent refinement. Our approach, Vaiage, addresses these challenges through a graph-structured multi-agent framework built around large language models (LLMs) that serve as both goal-conditioned recommenders and sequential planners. LLMs infer user intent, suggest personalized destinations and activities, and synthesize itineraries that align with contextual constraints such as budget, timing, group size, and weather. Through natural language interaction, structured tool use, and map-based feedback loops, Vaiage enables adaptive, explainable, and end-to-end travel planning grounded in both symbolic reasoning and conversational understanding. To evaluate Vaiage, we conducted human-in-the-loop experiments using rubric-based GPT-4 assessments and qualitative feedback. The full system achieved an average score of 8.5 out of 10, outperforming the no-strategy (7.2) and no-external-API (6.8) variants, particularly in feasibility. Qualitative analysis indicated that agent coordination - especially the Strategy and Information Agents - significantly improved itinerary quality by optimizing time use and integrating real-time context. These results demonstrate the effectiveness of combining LLM reasoning with symbolic agent coordination in open-ended, real-world planning tasks.
摘要
规划旅行是一项认知密集型任务,涉及用户偏好的冲突、动态外部信息以及多步骤时空优化。传统平台通常存在不足——它们提供静态结果、缺乏情境适应性,且不支持实时交互或意图细化。我们的解决方案Vaiage通过基于大语言模型(LLMs)构建的图结构多智能体框架应对这些挑战,该框架兼具目标条件推荐器和序列规划器的功能。大语言模型能够推断用户意图,推荐个性化目的地和活动,并综合生成符合预算、时间安排、团队规模和天气等情境约束的行程方案。通过自然语言交互、结构化工具使用和基于地图的反馈循环,Vaiage实现了植根于符号推理与会话理解的自适应、可解释、端到端旅行规划。为评估Vaiage,我们采用基于量规的GPT-4评估和定性反馈进行了人在环实验。完整系统平均得分为8.5分(满分10分),优于无策略版本(7.2分)和无外部API版本(6.8分),尤其在可行性方面表现突出。定性分析表明,智能体协调——特别是策略智能体与信息智能体——通过优化时间利用和整合实时情境,显著提升了行程质量。这些结果验证了在开放式现实世界规划任务中,将大语言模型推理与符号化智能体协调相结合的有效性。
Code-Driven Planning in Grid Worlds with Large Language Models
Abstract
arXiv:2505.10749v1 Announce Type: new Abstract: We propose an iterative programmatic planning (IPP) framework for solving grid-based tasks by synthesizing interpretable agent policies expressed in code using large language models (LLMs). Instead of relying on traditional search or reinforcement learning, our approach uses code generation as policy synthesis, where the LLM outputs executable programs that map environment states to action sequences. Our proposed architecture incorporates several prompting strategies, including direct code generation, pseudocode-conditioned refinement, and curriculum-based prompting, but also includes an iterative refinement mechanism that updates code based on task performance feedback. We evaluate our approach using six leading LLMs and two challenging grid-based benchmarks (GRASP and MiniGrid). Our IPP framework demonstrates improvements over direct code generation ranging from 10% to as much as 10x across five of the six models and establishes a new state-of-the-art result for GRASP. IPP is found to significantly outperform direct elicitation of a solution from GPT-o3-mini (by 63% on MiniGrid to 116% on GRASP), demonstrating the viability of the overall approach. Computational costs of all code generation approaches are similar. While code generation has a higher initial prompting cost compared to direct solution elicitation ($0.08 per task vs. $0.002 per instance for GPT-o3-mini), the code can be reused for any number of instances, making the amortized cost significantly lower (by 400x on GPT-o3-mini across the complete GRASP benchmark).
摘要
我们提出了一种迭代式程序化规划(IPP)框架,通过使用大型语言模型(LLMs)合成以代码形式表达的可解释智能体策略,来解决基于网格的任务。与传统搜索或强化学习方法不同,我们的方法将代码生成作为策略合成手段,由LLM输出可执行程序,将环境状态映射为动作序列。该架构整合了多种提示策略,包括直接代码生成、伪代码条件细化以及基于课程学习的提示方法,并引入了迭代优化机制,可根据任务性能反馈更新代码。我们在六个主流LLM模型和两个具有挑战性的网格基准测试(GRASP和MiniGrid)上评估了该方法。实验表明,IPP框架在六个模型中的五个上实现了10%至10倍的性能提升,并在GRASP测试中创造了新的最优结果。相较于GPT-o3-mini直接生成解决方案的方式,IPP表现出显著优势(MiniGrid提升63%,GRASP提升116%),验证了整体方法的可行性。所有代码生成方法的计算成本相近。虽然代码生成的初始提示成本高于直接解决方案生成(GPT-o3-mini每个任务0.08美元 vs 每个实例0.002美元),但生成的代码可无限次复用,使得平摊成本显著降低(在完整GRASP基准测试中GPT-o3-mini成本降低400倍)。
Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Abstract
arXiv:2505.10981v1 Announce Type: new Abstract: Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs 8 prompting strategies 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a method according to probability theory to quickly and accurately predict the scaling performance and select the best strategy under large sampling times without extra resource-intensive inference in practice. It can serve as the test-time scaling law for majority voting. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance.
摘要
近期,大型语言模型(LLM)的测试时计算规模扩展问题引发了广泛关注。然而,对于不同推理提示策略在规模扩展中的表现,现有研究仍较为有限。本文聚焦于一种标准且现实的规模扩展场景——多数投票机制,系统性地开展了6种LLM×8种提示策略×6个基准测试的实验。实验结果一致表明:随着采样次数和计算开销的增加,初始性能优越的复杂提示策略会逐渐被简单的思维链(Chain-of-Thought)策略反超。我们对此现象进行了分析并给出理论证明。此外,基于概率论提出了一种方法,可在无需额外资源密集型推理的情况下,快速准确地预测规模扩展性能,并选择大采样次数下的最优策略。该方法可作为多数投票机制下的测试时规模扩展定律。进一步地,我们根据理论分析提出两种显著提升规模扩展性能的优化方案。本研究有望推动学界重新审视复杂提示策略的作用,释放简单提示策略的潜力,并为提升测试时规模扩展性能提供新思路。
LLM-Enhanced Symbolic Control for Safety-Critical Applications
Abstract
arXiv:2505.11077v1 Announce Type: new Abstract: Motivated by Smart Manufacturing and Industry 4.0, we introduce a framework for synthesizing Abstraction-Based Controller Design (ABCD) for reach-avoid problems from Natural Language (NL) specifications using Large Language Models (LLMs). A Code Agent interprets an NL description of the control problem and translates it into a formal language interpretable by state-of-the-art symbolic control software, while a Checker Agent verifies the correctness of the generated code and enhances safety by identifying specification mismatches. Evaluations show that the system handles linguistic variability and improves robustness over direct planning with LLMs. The proposed approach lowers the barrier to formal control synthesis by enabling intuitive, NL-based task definition while maintaining safety guarantees through automated validation.
摘要
受智能制造与工业4.0的驱动,本研究提出了一种基于大语言模型的自然语言规范框架,用于解决可达-规避问题的抽象化控制器设计综合。该系统通过代码代理器解析控制问题的自然语言描述,并将其转换为可被前沿符号控制软件识别的形式化语言;同时校验代理器负责验证生成代码的正确性,并通过识别规范失配来增强安全性。评估表明,该系统能有效处理语言变异性,相较于直接使用大语言模型进行规划具有更强的鲁棒性。所提出的方法通过支持直观的自然语言任务定义降低了形式化控制综合的门槛,同时通过自动化验证机制保持了安全保证。
RAGSynth: Synthetic Data for Robust and Faithful RAG Component Optimization
Abstract
arXiv:2505.10989v1 Announce Type: new Abstract: RAG can enhance the performance of LLMs on knowledge-intensive tasks. Various RAG paradigms, including vanilla, planning-based, and iterative RAG, are built upon 2 cores: the retriever, which should robustly select relevant documents across complex queries, and the generator, which should faithfully synthesize responses. However, existing retrievers rely heavily on public knowledge and struggle with queries of varying logical complexity and clue completeness, while generators frequently face fidelity problems. In this work, we introduce RAGSynth, a framework that includes a data construction modeling and a corresponding synthetic data generation implementation, designed to optimize retriever robustness and generator fidelity. Additionally, we present SynthBench, a benchmark encompassing 8 domain-specific documents across 4 domains, featuring diverse query complexities, clue completeness, and fine-grained citation granularity. Leveraging RAGSynth, we generate a large-scale synthetic dataset, including single and multi-hop. Extensive experiments demonstrate that the synthetic data significantly improves the robustness of the retrievers and the fidelity of the generators. Additional evaluations confirm that RAGSynth can also generalize well across different domains. By integrating the optimized retrievers into various RAG paradigms, we consistently observe enhanced RAG system performance. We have open-sourced the implementation on https://github.com/EachSheep/RAGSynth.
摘要
RAG(检索增强生成)能够提升大语言模型在知识密集型任务中的表现。现有多种RAG范式(包括基础型、规划型和迭代型)均基于两个核心组件:检索器(需在复杂查询中稳健选择相关文档)和生成器(需忠实合成响应)。然而当前检索器过度依赖公共知识,难以应对不同逻辑复杂度与线索完整度的查询,而生成器则频繁面临保真度问题。本研究提出RAGSynth框架,包含数据构建建模与对应合成数据生成实现,旨在优化检索器鲁棒性与生成器保真度。我们同步推出SynthBench基准测试集,涵盖4个领域的8份专业文档,具有多样化查询复杂度、线索完整度及细粒度引用层级。基于RAGSynth生成的大规模合成数据集(含单跳与多跳查询)实验表明,合成数据显著提升了检索器的鲁棒性与生成器的保真度。额外评估证实RAGSynth具备良好的跨领域泛化能力。将优化后的检索器集成至各类RAG范式时,系统性能均获得持续提升。项目代码已开源:https://github.com/EachSheep/RAGSynth。
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning
Abstract
arXiv:2505.11049v1 Announce Type: new Abstract: To enhance the safety of VLMs, this paper introduces a novel reasoning-based VLM guard model dubbed GuardReasoner-VL. The core idea is to incentivize the guard model to deliberatively reason before making moderation decisions via online RL. First, we construct GuardReasoner-VLTrain, a reasoning corpus with 123K samples and 631K reasoning steps, spanning text, image, and text-image inputs. Then, based on it, we cold-start our model's reasoning ability via SFT. In addition, we further enhance reasoning regarding moderation through online RL. Concretely, to enhance diversity and difficulty of samples, we conduct rejection sampling followed by data augmentation via the proposed safety-aware data concatenation. Besides, we use a dynamic clipping parameter to encourage exploration in early stages and exploitation in later stages. To balance performance and token efficiency, we design a length-aware safety reward that integrates accuracy, format, and token cost. Extensive experiments demonstrate the superiority of our model. Remarkably, it surpasses the runner-up by 19.27% F1 score on average. We release data, code, and models (3B/7B) of GuardReasoner-VL at https://github.com/yueliu1999/GuardReasoner-VL/
摘要
为提升视觉语言模型(VLM)的安全性,本文提出了一种新型基于推理的VLM防护模型GuardReasoner-VL。其核心思想是通过在线强化学习激励防护模型在做出审核决策前进行审慎推理。首先,我们构建了包含12.3万样本和63.1万推理步骤的多模态推理语料库GuardReasoner-VLTrain,涵盖文本、图像及图文混合输入。基于该语料库,我们通过监督微调冷启动模型的推理能力,并进一步利用在线强化学习增强审核相关的推理能力。具体而言,为提升样本多样性和难度,我们采用拒绝采样策略并结合提出的安全感知数据拼接方法进行数据增强。此外,通过动态剪裁参数设计,在训练早期鼓励探索而后期侧重利用。为平衡性能与标记效率,我们设计了融合准确率、格式合规性和标记成本的长度感知安全奖励机制。大量实验验证了模型的优越性,其F1分数平均超越次优模型19.27%。我们在https://github.com/yueliu1999/GuardReasoner-VL/开源了GuardReasoner-VL的数据、代码及模型(3B/7B版本)。
MPS-Prover: Advancing Stepwise Theorem Proving by Multi-Perspective Search and Data Curation
Abstract
arXiv:2505.10962v1 Announce Type: new Abstract: Automated Theorem Proving (ATP) in formal languages remains a formidable challenge in AI, demanding rigorous logical deduction and navigating vast search spaces. While large language models (LLMs) have shown promising performance, existing stepwise provers often suffer from biased search guidance, leading to inefficiencies and suboptimal proof strategies. This paper introduces the Multi-Perspective Search Prover (MPS-Prover), a novel stepwise ATP system designed to overcome these limitations. MPS-Prover incorporates two key innovations: a highly effective post-training data curation strategy that prunes approximately 40% of redundant training data without sacrificing performance, and a multi-perspective tree search mechanism. This search integrates a learned critic model with strategically designed heuristic rules to diversify tactic selection, prevent getting trapped in unproductive states, and enhance search robustness. Extensive evaluations demonstrate that MPS-Prover achieves state-of-the-art performance on multiple challenging benchmarks, including miniF2F and ProofNet, outperforming prior 7B parameter models. Furthermore, our analyses reveal that MPS-Prover generates significantly shorter and more diverse proofs compared to existing stepwise and whole-proof methods, highlighting its efficiency and efficacy. Our work advances the capabilities of LLM-based formal reasoning and offers a robust framework and a comprehensive analysis for developing more powerful theorem provers.
摘要
形式化语言中的自动定理证明(ATP)始终是人工智能领域的一项艰巨挑战,其要求严格的逻辑推演能力并需应对庞大的搜索空间。尽管大语言模型(LLM)已展现出优异性能,现有逐步式证明器常因存在搜索导向偏差而导致效率低下与证明策略欠优。本文提出多视角搜索证明器(MPS-Prover),这一新型逐步式ATP系统旨在突破这些局限。MPS-Prover包含两项关键创新:其一为高效的后训练数据优化策略,可在保持性能前提下剔除约40%冗余训练数据;其二为多视角树搜索机制,该机制通过将学习型评判模型与策略性设计的启发式规则相结合,实现战术选择多样化、避免陷入无效状态并增强搜索鲁棒性。大量实验表明,MPS-Prover在miniF2F和ProofNet等多个高难度基准测试中达到最先进性能,优于此前70亿参数模型。进一步分析显示,相较于现有逐步式与整体式证明方法,MPS-Prover生成的证明过程显著更短且更具多样性,充分体现其高效性与优越性。本研究推动了基于LLM的形式推理能力发展,并为开发更强大的定理证明器提供了稳健框架与系统性分析。
Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
Abstract
arXiv:2505.11063v1 Announce Type: new Abstract: LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner modifies only the reasoning phase without altering the underlying agent framework, making it easy to deploy and widely applicable to various agent frameworks. To train the Thought-Aligner model, we construct an instruction dataset across ten representative scenarios and simulate ReAct execution trajectories, generating 5,000 diverse instructions and more than 11,400 safe and unsafe thought pairs. The model is fine-tuned using contrastive learning techniques. Experiments across three agent safety benchmarks involving 12 different LLMs demonstrate that Thought-Aligner raises agent behavioral safety from approximately 50% in the unprotected setting to 90% on average. Additionally, Thought-Aligner maintains response latency below 100ms with minimal resource usage, demonstrating its capability for efficient deployment, broad applicability, and timely responsiveness. This method thus provides a practical dynamic safety solution for the LLM-based agents.
摘要
基于大语言模型(LLM)的自主智能体具备推理、工具调用与环境交互等能力,可执行复杂的多步骤任务。行为轨迹中的内部推理过程(即思维)会显著影响工具使用与后续行动,但也可能引入潜在风险。即使智能体思维出现微小偏差,也可能引发连锁反应导致不可逆的安全事故。针对长周期行为轨迹中的安全对齐挑战,本研究提出Thought-Aligner——一种插件式动态思维校正模块。该模块采用轻量级、低资源消耗的模型,在每项行动执行前实时修正高风险思维,并将校正后的思维重新注入智能体,从而确保后续决策与工具交互的安全性。值得注意的是,Thought-Aligner仅修改推理阶段而不改变底层智能体框架,使其易于部署并广泛适用于各类智能体框架。为训练Thought-Aligner模型,我们构建了涵盖十种典型场景的指令数据集,模拟ReAct执行轨迹,生成5,000条多样化指令及超过11,400组安全/不安全思维对,并采用对比学习技术进行模型微调。在包含12种不同LLM的三个智能体安全基准测试中,实验表明Thought-Aligner能将无防护状态下约50%的行为安全率提升至平均90%。此外,该模块在极低资源消耗下保持响应延迟低于100毫秒,展现出高效部署、广泛适用和即时响应的能力。该方法为基于LLM的智能体提供了实用的动态安全解决方案。
Time Travel is Cheating: Going Live with DeepFund for Real-Time Fund Investment Benchmarking
Abstract
arXiv:2505.11065v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated notable capabilities across financial tasks, including financial report summarization, earnings call transcript analysis, and asset classification. However, their real-world effectiveness in managing complex fund investment remains inadequately assessed. A fundamental limitation of existing benchmarks for evaluating LLM-driven trading strategies is their reliance on historical back-testing, inadvertently enabling LLMs to "time travel"-leveraging future information embedded in their training corpora, thus resulting in possible information leakage and overly optimistic performance estimates. To address this issue, we introduce DeepFund, a live fund benchmark tool designed to rigorously evaluate LLM in real-time market conditions. Utilizing a multi-agent architecture, DeepFund connects directly with real-time stock market data-specifically data published after each model pretraining cutoff-to ensure fair and leakage-free evaluations. Empirical tests on nine flagship LLMs from leading global institutions across multiple investment dimensions-including ticker-level analysis, investment decision-making, portfolio management, and risk control-reveal significant practical challenges. Notably, even cutting-edge models such as DeepSeek-V3 and Claude-3.7-Sonnet incur net trading losses within DeepFund real-time evaluation environment, underscoring the present limitations of LLMs for active fund management. Our code is available at https://github.com/HKUSTDial/DeepFund.
摘要
大型语言模型(LLMs)在金融任务中展现出显著能力,包括财务报告摘要、财报电话会议记录分析和资产分类等。然而,其在管理复杂基金投资中的实际有效性尚未得到充分评估。现有评估LLM驱动交易策略的基准存在根本性局限——依赖历史回测方法,这无意中使LLMs能够"时间穿越":利用训练语料中隐含的未来信息,从而导致潜在的信息泄露和过于乐观的性能预估。为解决该问题,我们推出DeepFund实时基金基准工具,旨在真实市场环境下严格评估LLMs。通过多智能体架构,DeepFund直接对接实时股市数据(特别采用各模型预训练截止日期后发布的数据),确保公平且无信息泄露的评估。针对全球顶尖机构的九款旗舰LLMs进行的实证测试(涵盖个股分析、投资决策、组合管理和风险控制等多维度)揭示了重大实践挑战。值得注意的是,即便是DeepSeek-V3和Claude-3.7-Sonnet等前沿模型,在DeepFund实时评估环境中也出现净交易亏损,这凸显了LLMs在主动型基金管理中的当前局限性。代码已开源:https://github.com/HKUSTDial/DeepFund。
Navigating the Alpha Jungle: An LLM-Powered MCTS Framework for Formulaic Factor Mining
Abstract
arXiv:2505.11122v1 Announce Type: new Abstract: Alpha factor mining is pivotal in quantitative investment for identifying predictive signals from complex financial data. While traditional formulaic alpha mining relies on human expertise, contemporary automated methods, such as those based on genetic programming or reinforcement learning, often suffer from search inefficiency or yield poorly interpretable alpha factors. This paper introduces a novel framework that integrates Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS) to overcome these limitations. Our approach leverages the LLM's instruction-following and reasoning capability to iteratively generate and refine symbolic alpha formulas within an MCTS-driven exploration. A key innovation is the guidance of MCTS exploration by rich, quantitative feedback from financial backtesting of each candidate factor, enabling efficient navigation of the vast search space. Furthermore, a frequent subtree avoidance mechanism is introduced to bolster search efficiency and alpha factor performance. Experimental results on real-world stock market data demonstrate that our LLM-based framework outperforms existing methods by mining alphas with superior predictive accuracy, trading performance, and improved interpretability, while offering a more efficient solution for formulaic alpha mining.
摘要
阿尔法因子挖掘在量化投资中对于从复杂金融数据中识别预测信号至关重要。传统公式化阿尔法挖掘依赖人工经验,而当代自动化方法(如基于遗传编程或强化学习的方法)常面临搜索效率低下或生成可解释性差的阿尔法因子等问题。本文提出一种创新框架,通过整合大语言模型(LLMs)与蒙特卡洛树搜索(MCTS)来克服这些局限。该方法利用LLM的指令遵循与推理能力,在MCTS驱动的探索中迭代生成并优化符号化阿尔法公式。关键创新在于通过候选因子金融回测提供的量化反馈来引导MCTS探索,从而实现对庞大搜索空间的高效遍历。此外,引入频繁子树规避机制以提升搜索效率与阿尔法因子表现。基于真实股市数据的实验结果表明,本框架通过挖掘具有更高预测精度、交易表现及增强可解释性的阿尔法因子,性能优于现有方法,同时为公式化阿尔法挖掘提供了更高效的解决方案。
Group Think: Multiple Concurrent Reasoning Agents Collaborating at Token Level Granularity
Abstract
arXiv:2505.11107v1 Announce Type: new Abstract: Recent advances in large language models (LLMs) have demonstrated the power of reasoning through self-generated chains of thought. Multiple reasoning agents can collaborate to raise joint reasoning quality above individual outcomes. However, such agents typically interact in a turn-based manner, trading increased latency for improved quality. In this paper, we propose Group Think--a single LLM that acts as multiple concurrent reasoning agents, or thinkers. With shared visibility into each other's partial generation progress, Group Think introduces a new concurrent-reasoning paradigm in which multiple reasoning trajectories adapt dynamically to one another at the token level. For example, a reasoning thread may shift its generation mid-sentence upon detecting that another thread is better positioned to continue. This fine-grained, token-level collaboration enables Group Think to reduce redundant reasoning and improve quality while achieving significantly lower latency. Moreover, its concurrent nature allows for efficient utilization of idle computational resources, making it especially suitable for edge inference, where very small batch size often underutilizes local~GPUs. We give a simple and generalizable modification that enables any existing LLM to perform Group Think on a local GPU. We also present an evaluation strategy to benchmark reasoning latency and empirically demonstrate latency improvements using open-source LLMs that were not explicitly trained for Group Think. We hope this work paves the way for future LLMs to exhibit more sophisticated and more efficient collaborative behavior for higher quality generation.
摘要
大语言模型(LLM)的最新进展展示了通过自生成思维链进行推理的能力。多个推理代理可以通过协作将联合推理质量提升至超越个体结果的水平。然而,此类代理通常以轮替方式交互,以增加延迟为代价换取质量提升。本文提出"群体思维"(Group Think)——由单个LLM模拟多个并发推理代理(或称思考者)。通过共享彼此部分生成进度的可见性,群体思维引入了一种新的并发推理范式,其中多个推理轨迹在词元级别动态相互适应。例如,当检测到另一线程更适合继续生成时,推理线程可能在句子中途改变其生成内容。这种细粒度的词元级协作使群体思维能够减少冗余推理并提升质量,同时显著降低延迟。此外,其并发特性允许高效利用闲置计算资源,特别适用于边缘推理场景——该场景下极小批量大小常导致本地GPU利用率不足。我们提出了一种简单且可泛化的修改方案,使现有LLM均能在本地GPU上实现群体思维。同时提出评估策略以基准测试推理延迟,并通过未经群体思维专门训练的开源LLM实证展示了延迟改进。我们希望这项工作能为未来LLM实现更复杂、更高效的协作行为以生成更优质内容开辟道路。
Feasibility with Language Models for Open-World Compositional Zero-Shot Learning
Abstract
arXiv:2505.11181v1 Announce Type: new Abstract: Humans can easily tell if an attribute (also called state) is realistic, i.e., feasible, for an object, e.g. fire can be hot, but it cannot be wet. In Open-World Compositional Zero-Shot Learning, when all possible state-object combinations are considered as unseen classes, zero-shot predictors tend to perform poorly. Our work focuses on using external auxiliary knowledge to determine the feasibility of state-object combinations. Our Feasibility with Language Model (FLM) is a simple and effective approach that leverages Large Language Models (LLMs) to better comprehend the semantic relationships between states and objects. FLM involves querying an LLM about the feasibility of a given pair and retrieving the output logit for the positive answer. To mitigate potential misguidance of the LLM given that many of the state-object compositions are rare or completely infeasible, we observe that the in-context learning ability of LLMs is essential. We present an extensive study identifying Vicuna and ChatGPT as best performing, and we demonstrate that our FLM consistently improves OW-CZSL performance across all three benchmarks.
摘要
人类可以轻松判断某个属性(或称状态)对于物体是否真实可行,例如火可以是热的,但不能是湿的。在开放世界组合零样本学习中,当所有可能的状态-物体组合都被视为未见类别时,零样本预测器的表现往往欠佳。本研究重点利用外部辅助知识来确定状态-物体组合的可行性。我们提出的语言模型可行性评估方法(FLM)是一种简单有效的方案,通过利用大型语言模型(LLMs)来更好地理解状态与物体之间的语义关系。FLM的核心操作是向LLM查询给定组合的可行性,并获取肯定答案的输出逻辑值。考虑到许多状态-物体组合较为罕见或完全不可行可能误导LLM,我们发现LLM的上下文学习能力至关重要。我们通过广泛研究确定Vicuna和ChatGPT表现最佳,并证明FLM在全部三个基准测试中持续提升了开放世界组合零样本学习的性能。
Can Global XAI Methods Reveal Injected Bias in LLMs? SHAP vs Rule Extraction vs RuleSHAP
Abstract
arXiv:2505.11189v1 Announce Type: new Abstract: Generative AI systems can help spread information but also misinformation and biases, potentially undermining the UN Sustainable Development Goals (SDGs). Explainable AI (XAI) aims to reveal the inner workings of AI systems and expose misbehaviours or biases. However, current XAI tools, built for simpler models, struggle to handle the non-numerical nature of large language models (LLMs). This paper examines the effectiveness of global XAI methods, such as rule-extraction algorithms and SHAP, in detecting bias in LLMs. To do so, we first show a text-to-ordinal mapping strategy to convert non-numerical inputs/outputs into numerical features, enabling these tools to identify (some) misinformation-related biases in LLM-generated content. Then, we inject non-linear biases of varying complexity (univariate, conjunctive, and non-convex) into widespread LLMs like ChatGPT and Llama via system instructions, using global XAI methods to detect them. This way, we found that RuleFit struggles with conjunctive and non-convex biases, while SHAP can approximate conjunctive biases but cannot express them as actionable rules. Hence, we introduce RuleSHAP, a global rule extraction algorithm combining SHAP and RuleFit to detect more non-univariate biases, improving injected bias detection over RuleFit by +94% (MRR@1) on average.
摘要
生成式人工智能系统在传播信息的同时也可能助长错误信息和偏见,从而可能破坏联合国可持续发展目标(SDGs)。可解释人工智能(XAI)旨在揭示AI系统的内部运作机制并暴露其不当行为或偏见。然而,当前为简单模型设计的XAI工具难以处理大型语言模型(LLMs)的非数值特性。本文研究了全局XAI方法(如规则提取算法和SHAP)在检测LLMs偏见方面的有效性。为此,我们首先提出一种文本到序数的映射策略,将非数值输入/输出转换为数值特征,使这些工具能够识别LLM生成内容中(部分)与错误信息相关的偏见。接着,我们通过系统指令向ChatGPT和Llama等主流LLMs注入不同复杂度(单变量、合取和非凸)的非线性偏见,并利用全局XAI方法进行检测。研究发现,RuleFit难以处理合取和非凸偏见,而SHAP虽能近似识别合取偏见却无法将其转化为可操作规则。为此,我们提出RuleSHAP算法——一种结合SHAP与RuleFit的全局规则提取方法,可检测更多非单变量偏见,其注入偏见的检测性能较RuleFit平均提升94%(MRR@1指标)。
Prot2Text-V2: Protein Function Prediction with Multimodal Contrastive Alignment
Abstract
arXiv:2505.11194v1 Announce Type: new Abstract: Predicting protein function from sequence is a central challenge in computational biology. While existing methods rely heavily on structured ontologies or similarity-based techniques, they often lack the flexibility to express structure-free functional descriptions and novel biological functions. In this work, we introduce Prot2Text-V2, a novel multimodal sequence-to-text model that generates free-form natural language descriptions of protein function directly from amino acid sequences. Our method combines a protein language model as a sequence encoder (ESM-3B) and a decoder-only language model (LLaMA-3.1-8B-Instruct) through a lightweight nonlinear modality projector. A key innovation is our Hybrid Sequence-level Contrastive Alignment Learning (H-SCALE), which improves cross-modal learning by matching mean- and std-pooled protein embeddings with text representations via contrastive loss. After the alignment phase, we apply instruction-based fine-tuning using LoRA on the decoder to teach the model how to generate accurate protein function descriptions conditioned on the protein sequence. We train Prot2Text-V2 on about 250K curated entries from SwissProt and evaluate it under low-homology conditions, where test sequences have low similarity with training samples. Prot2Text-V2 consistently outperforms traditional and LLM-based baselines across various metrics.
摘要
预测蛋白质功能是计算生物学领域的核心挑战。现有方法主要依赖结构化本体论或基于相似性的技术,往往难以灵活表达非结构化的功能描述和新型生物学功能。本研究提出Prot2Text-V2——一种新型多模态序列到文本模型,可直接从氨基酸序列生成自由形式的蛋白质功能自然语言描述。该方法通过轻量级非线性模态投影器,将蛋白质语言模型(ESM-3B)作为序列编码器与仅解码语言模型(LLaMA-3.1-8B-Instruct)相结合。关键创新在于混合序列级对比对齐学习(H-SCALE),通过对比损失将均值池化和标准差池化的蛋白质嵌入与文本表示进行匹配,从而提升跨模态学习效果。在完成对齐阶段后,我们采用基于指令的LoRA微调方法训练解码器,使模型能够根据蛋白质序列生成准确的功能描述。Prot2Text-V2在SwissProt约25万条精选条目上完成训练,并在低同源条件下(测试序列与训练样本相似度低)进行评估。实验结果表明,该模型在各项指标上均优于传统方法和基于大语言模型的基线系统。
SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning
Abstract
arXiv:2505.11274v1 Announce Type: new Abstract: Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models inefficiently over-process both trivial and complex queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive controllable reasoning strategy for efficient reasoning. Our approach adopts a dual-phase training paradigm: first, the model learns to pre-estimate the reasoning cost based on the difficulty of the query. Then, we introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. SelfBudgeter allows users to anticipate generation time and make informed decisions about continuing or interrupting the process. Furthermore, our method enables direct manipulation of reasoning length via pre-filling token budget. Experimental results demonstrate that SelfBudgeter can rationally allocate budgets according to problem complexity, achieving up to 74.47% response length compression on the MATH benchmark while maintaining nearly undiminished accuracy.
摘要
近期,大型推理模型在各类任务中展现出卓越性能。然而,这些模型在处理简单与复杂查询时均存在低效过度处理现象,导致资源浪费和用户延迟增加。为应对这一挑战,我们提出SelfBudgeter——一种自适应可控的高效推理策略。该方法采用双阶段训练范式:首先,模型学习基于查询难度预先估算推理成本;其次,我们引入预算引导的GPRO强化学习方法,在保持精度的同时有效缩减输出长度。SelfBudgeter使用户能够预判生成时间,并据此做出继续或中断过程的决策。此外,该方法支持通过预填充令牌预算直接调控推理长度。实验结果表明,SelfBudgeter能根据问题复杂度合理分配预算,在MATH基准测试中实现最高74.47%的响应长度压缩,同时保持几乎无损的准确率。
LD-Scene: LLM-Guided Diffusion for Controllable Generation of Adversarial Safety-Critical Driving Scenarios
Abstract
arXiv:2505.11247v1 Announce Type: new Abstract: Ensuring the safety and robustness of autonomous driving systems necessitates a comprehensive evaluation in safety-critical scenarios. However, these safety-critical scenarios are rare and difficult to collect from real-world driving data, posing significant challenges to effectively assessing the performance of autonomous vehicles. Typical existing methods often suffer from limited controllability and lack user-friendliness, as extensive expert knowledge is essentially required. To address these challenges, we propose LD-Scene, a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) for user-controllable adversarial scenario generation through natural language. Our approach comprises an LDM that captures realistic driving trajectory distributions and an LLM-based guidance module that translates user queries into adversarial loss functions, facilitating the generation of scenarios aligned with user queries. The guidance module integrates an LLM-based Chain-of-Thought (CoT) code generator and an LLM-based code debugger, enhancing the controllability and robustness in generating guidance functions. Extensive experiments conducted on the nuScenes dataset demonstrate that LD-Scene achieves state-of-the-art performance in generating realistic, diverse, and effective adversarial scenarios. Furthermore, our framework provides fine-grained control over adversarial behaviors, thereby facilitating more effective testing tailored to specific driving scenarios.
摘要
确保自动驾驶系统的安全性和鲁棒性需要在安全关键场景中进行全面评估。然而,这类安全关键场景在现实驾驶数据中极为罕见且难以采集,这对有效评估自动驾驶车辆性能构成了重大挑战。现有典型方法通常存在可控性有限和用户友好性不足的问题,因其本质上需要大量专家知识。为解决这些挑战,我们提出了LD-Scene——一个将大语言模型(LLMs)与潜在扩散模型(LDMs)相结合的新型框架,通过自然语言实现用户可控的对抗场景生成。该框架包含一个捕捉真实驾驶轨迹分布的LDM,以及一个基于LLM的引导模块,该模块将用户查询转化为对抗性损失函数,从而生成符合用户需求的场景。引导模块整合了基于LLM的思维链(CoT)代码生成器和基于LLM的代码调试器,提升了生成引导函数的可控性和鲁棒性。在nuScenes数据集上进行的大量实验表明,LD-Scene在生成真实、多样且有效的对抗场景方面达到了最先进水平。此外,我们的框架提供了对对抗行为的细粒度控制,从而能够针对特定驾驶场景开展更有效的测试。
Is PRM Necessary? Problem-Solving RL Implicitly Induces PRM Capability in LLMs
Abstract
arXiv:2505.11227v1 Announce Type: new Abstract: The development of reasoning capabilities represents a critical frontier in large language models (LLMs) research, where reinforcement learning (RL) and process reward models (PRMs) have emerged as predominant methodological frameworks. Contrary to conventional wisdom, empirical evidence from DeepSeek-R1 demonstrates that pure RL training focused on mathematical problem-solving can progressively enhance reasoning abilities without PRM integration, challenging the perceived necessity of process supervision. In this study, we conduct a systematic investigation of the relationship between RL training and PRM capabilities. Our findings demonstrate that problem-solving proficiency and process supervision capabilities represent complementary dimensions of reasoning that co-evolve synergistically during pure RL training. In particular, current PRMs underperform simple baselines like majority voting when applied to state-of-the-art models such as DeepSeek-R1 and QwQ-32B. To address this limitation, we propose Self-PRM, an introspective framework in which models autonomously evaluate and rerank their generated solutions through self-reward mechanisms. Although Self-PRM consistently improves the accuracy of the benchmark (particularly with larger sample sizes), analysis exposes persistent challenges: The approach exhibits low precision (<10%) on difficult problems, frequently misclassifying flawed solutions as valid. These analyses underscore the need for continued RL scaling to improve reward alignment and introspective accuracy. Overall, our findings suggest that PRM may not be essential for enhancing complex reasoning, as pure RL not only improves problem-solving skills but also inherently fosters robust PRM capabilities. We hope these findings provide actionable insights for building more reliable and self-aware complex reasoning models.
摘要
推理能力的发展是大型语言模型(LLM)研究的关键前沿,其中强化学习(RL)和过程奖励模型(PRM)已成为主流方法论框架。与传统观点相反,DeepSeek-R1的实证研究表明,专注于数学问题解决的纯RL训练无需整合PRM即可逐步提升推理能力,这一发现对过程监督的必要性提出了挑战。本研究系统性地探讨了RL训练与PRM能力之间的关系,发现解题能力与过程监督能力是推理的两个互补维度,在纯RL训练过程中会协同演化。值得注意的是,当应用于DeepSeek-R1和QwQ-32B等前沿模型时,现有PRM的表现甚至不及多数投票等简单基线方法。为突破这一局限,我们提出Self-PRM框架——该自省机制使模型通过自我奖励机制自主评估并重新排序生成的解决方案。尽管Self-PRM能持续提升基准测试准确率(尤其在更大样本量时),分析仍揭示出持续存在的挑战:该方法在难题上表现出的精确度较低(<10%),经常将存在缺陷的解决方案误判为有效。这些分析表明需要持续扩展RL规模以改进奖励对齐和自省准确性。总体而言,我们的研究结果表明PRM对于增强复杂推理可能并非必需,因为纯RL不仅能提升问题解决能力,还能内生地培育强大的PRM能力。希望这些发现能为构建更可靠、更具自我意识的复杂推理模型提供可操作的见解。
TAIJI: MCP-based Multi-Modal Data Analytics on Data Lakes
Abstract
arXiv:2505.11270v1 Announce Type: new Abstract: The variety of data in data lakes presents significant challenges for data analytics, as data scientists must simultaneously analyze multi-modal data, including structured, semi-structured, and unstructured data. While Large Language Models (LLMs) have demonstrated promising capabilities, they still remain inadequate for multi-modal data analytics in terms of accuracy, efficiency, and freshness. First, current natural language (NL) or SQL-like query languages may struggle to precisely and comprehensively capture users' analytical intent. Second, relying on a single unified LLM to process diverse data modalities often leads to substantial inference overhead. Third, data stored in data lakes may be incomplete or outdated, making it essential to integrate external open-domain knowledge to generate timely and relevant analytics results. In this paper, we envision a new multi-modal data analytics system. Specifically, we propose a novel architecture built upon the Model Context Protocol (MCP), an emerging paradigm that enables LLMs to collaborate with knowledgeable agents. First, we define a semantic operator hierarchy tailored for querying multi-modal data in data lakes and develop an AI-agent-powered NL2Operator translator to bridge user intent and analytical execution. Next, we introduce an MCP-based execution framework, in which each MCP server hosts specialized foundation models optimized for specific data modalities. This design enhances both accuracy and efficiency, while supporting high scalability through modular deployment. Finally, we propose a updating mechanism by harnessing the deep research and machine unlearning techniques to refresh the data lakes and LLM knowledges, with the goal of balancing the data freshness and inference efficiency.
摘要
数据湖中数据的多样性给数据分析带来了重大挑战,数据科学家需要同时分析包括结构化、半结构化和非结构化数据在内的多模态数据。尽管大语言模型(LLMs)已展现出良好的能力,但在准确性、效率和时效性方面仍不足以满足多模态数据分析的需求。首先,当前的自然语言(NL)或类SQL查询语言可能难以精确且全面地捕捉用户的分析意图;其次,依赖单一统一的大语言模型处理多样化的数据模态通常会导致显著的推理开销;第三,数据湖中存储的数据可能存在不完整或过时问题,因此必须整合外部开放域知识以生成及时相关的分析结果。
本文提出了一种新型多模态数据分析系统。具体而言,我们设计了一种基于模型上下文协议(MCP)的创新架构,该新兴范式可使大语言模型与知识代理协同工作。首先,我们定义了专为查询数据湖多模态数据设计的语义操作符层次结构,并开发了由AI代理驱动的自然语言到操作符转换器(NL2Operator),以桥接用户意图与分析执行。其次,我们提出了基于MCP的执行框架,其中每个MCP服务器托管针对特定数据模态优化的专用基础模型,该设计在提升准确性和效率的同时,通过模块化部署支持高度可扩展性。最后,我们通过深度研究和机器遗忘技术构建更新机制,以刷新数据湖和大语言模型知识,旨在平衡数据时效性与推理效率。
TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Abstract
arXiv:2505.11329v1 Announce Type: new Abstract: Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLINK. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Further, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The computation of one subset is then overlapped with the communication of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce-RMSNorm kernel carefully leveraging Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory bound RMSNorm to be overlapped with the other batch's computation, providing additional gains. Our evaluations demonstrate up to 29% latency gains and up to 26% throughput gains across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
摘要
大型语言模型(LLM)的分布式推理即使在通过NVLINK等高速互连连接的GPU上也会产生高达20%的开销。目前已提出多种技术通过将计算分解为更细粒度的任务,并在子任务完成时重叠通信来缓解这些开销。然而,在GPU上将大规模计算细粒度分解为多个小计算会产生额外开销。此外,通信本身会占用大量流式多处理器(SM),进一步增加了开销。
我们提出TokenWeave来解决这些挑战。TokenWeave采用了一种令牌分割技术,以波感知方式将推理批次的令牌划分为两个近似相等的子集,使一个子集的计算与另一个子集的通信重叠执行。此外,TokenWeave优化了层归一化计算相对于通信操作的执行顺序,并实现了一种新颖的融合式AllReduce-RMSNorm内核,充分利用NVIDIA Hopper GPU的多存储器指令支持。这些优化使TokenWeave仅需2-8个SM即可完成通信和RMSNorm操作。我们的内核还实现了内存受限的RMSNorm与其他批次计算的重叠,从而获得额外收益。评估结果表明,在多种模型和工作负载下,TokenWeave可实现高达29%的延迟降低和26%的吞吐量提升。在多个场景中,TokenWeave的性能甚至优于移除了所有通信的等效模型。
LLM-Explorer: Towards Efficient and Affordable LLM-based Exploration for Mobile Apps
Abstract
arXiv:2505.10593v1 Announce Type: cross Abstract: Large language models (LLMs) have opened new opportunities for automated mobile app exploration, an important and challenging problem that used to suffer from the difficulty of generating meaningful UI interactions. However, existing LLM-based exploration approaches rely heavily on LLMs to generate actions in almost every step, leading to a huge cost of token fees and computational resources. We argue that such extensive usage of LLMs is neither necessary nor effective, since many actions during exploration do not require, or may even be biased by the abilities of LLMs. Further, based on the insight that a precise and compact knowledge plays the central role for effective exploration, we introduce LLM-Explorer, a new exploration agent designed for efficiency and affordability. LLM-Explorer uses LLMs primarily for maintaining the knowledge instead of generating actions, and knowledge is used to guide action generation in a LLM-less manner. Based on a comparison with 5 strong baselines on 20 typical apps, LLM-Explorer was able to achieve the fastest and highest coverage among all automated app explorers, with over 148x lower cost than the state-of-the-art LLM-based approach.
摘要
大语言模型(LLMs)为自动化移动应用探索开辟了新途径,这一重要且具有挑战性的问题曾因难以生成有意义的用户界面交互而受阻。然而,现有基于LLM的探索方法几乎每一步都严重依赖LLM生成操作,导致高昂的令牌费用和计算资源消耗。我们认为这种对LLM的过度使用既不必要也不高效,因为探索过程中的许多操作并不需要LLM参与,甚至可能受LLM能力影响而产生偏差。进一步地,基于"精确而紧凑的知识是有效探索核心"这一洞见,我们提出了LLM-Explorer——一种兼顾高效性与经济性的新型探索智能体。该智能体主要利用LLM维护知识而非生成操作,并通过非LLM方式利用知识指导操作生成。在与20款典型应用程序上5个强基线的对比实验中,LLM-Explorer在所有自动化应用探索器中实现了最快速度和最高覆盖率,其成本较最先进的基于LLM的方法降低了148倍以上。
Two Minds Better Than One: Collaborative Reward Modeling for LLM Alignment
Abstract
arXiv:2505.10597v1 Announce Type: cross Abstract: Reward models (RMs) are essential for aligning large language models (LLMs) with human values. However, noisy preferences in human feedback often lead to reward misgeneralization, where RMs overfit to spurious patterns and provide misleading signals during policy optimization. We systematically analyze the training dynamics of preference pairs and identify that noisy examples are harder to fit and introduce instability. Empirical evidence shows that LLMs optimized using reward models trained on full noisy datasets perform worse than those trained on filtered, high-quality preferences. To address this, we propose Collaborative Reward Modeling (CRM), an online framework that enhances robustness by combining peer review and curriculum learning. Two reward models are trained in parallel and assess each other's data selections to filter out potential noise. Curriculum learning structures the preference data from easy to hard, ensuring synchronized training and stable feedback. Extensive experiments demonstrate that CRM improves generalization, with up to 9.94 points of accuracy gain on RewardBench under 40 percent label noise. CRM is also compatible with implicit-reward alignment methods, offering a practical and versatile strategy for robust alignment.
摘要
奖励模型(RMs)对于将大语言模型(LLMs)与人类价值观对齐至关重要。然而,人类反馈中的噪声偏好常导致奖励错误泛化,即奖励模型过度拟合虚假模式并在策略优化过程中产生误导性信号。我们系统分析了偏好对的训练动态,发现噪声样本更难拟合且会引入不稳定性。实证研究表明,使用基于完整噪声数据集训练的奖励模型优化的LLMs,其表现逊色于基于过滤后高质量偏好训练的模型。为此,我们提出协同奖励建模(CRM),这是一个通过结合同行评审和课程学习来增强鲁棒性的在线框架。两个奖励模型并行训练并相互评估数据选择以过滤潜在噪声。课程学习将偏好数据按从易到难的结构组织,确保同步训练和稳定反馈。大量实验表明,CRM在40%标签噪声下可使RewardBench准确率提升高达9.94个百分点,显著改善了泛化能力。该框架还与隐式奖励对齐方法兼容,为稳健对齐提供了实用且通用的策略。
CRPE: Expanding The Reasoning Capability of Large Language Model for Code Generation
Abstract
arXiv:2505.10594v1 Announce Type: cross Abstract: We introduce CRPE (Code Reasoning Process Enhancer), an innovative three-stage framework for data synthesis and model training that advances the development of sophisticated code reasoning capabilities in large language models (LLMs). Building upon existing system-1 models, CRPE addresses the fundamental challenge of enhancing LLMs' analytical and logical processing in code generation tasks. Our framework presents a methodologically rigorous yet implementable approach to cultivating advanced code reasoning abilities in language models. Through the implementation of CRPE, we successfully develop an enhanced COT-Coder that demonstrates marked improvements in code generation tasks. Evaluation results on LiveCodeBench (20240701-20240901) demonstrate that our COT-Coder-7B-StepDPO, derived from Qwen2.5-Coder-7B-Base, with a pass@1 accuracy of 21.88, exceeds all models with similar or even larger sizes. Furthermore, our COT-Coder-32B-StepDPO, based on Qwen2.5-Coder-32B-Base, exhibits superior performance with a pass@1 accuracy of 35.08, outperforming GPT4O on the benchmark. Overall, CRPE represents a comprehensive, open-source method that encompasses the complete pipeline from instruction data acquisition through expert code reasoning data synthesis, culminating in an autonomous reasoning enhancement mechanism.
摘要
我们提出CRPE(代码推理过程增强器),这是一种创新的三阶段数据合成与模型训练框架,旨在提升大语言模型(LLMs)的复杂代码推理能力。基于现有系统1模型,CRPE解决了增强LLMs在代码生成任务中分析与逻辑处理能力的核心挑战。该框架提供了一种方法严谨且可实施的途径,用于培养语言模型的高级代码推理能力。通过实施CRPE,我们成功开发出增强版COT-Coder,其在代码生成任务中表现出显著提升。在LiveCodeBench(20240701-20240901)的评估结果显示,基于Qwen2.5-Coder-7B-Base的COT-Coder-7B-StepDPO以21.88的pass@1准确率超越所有同规模甚至更大规模的模型。此外,基于Qwen2.5-Coder-32B-Base的COT-Coder-32B-StepDPO展现出更优异的性能,其35.08的pass@1准确率在基准测试中超越了GPT4O。总体而言,CRPE代表了一种全面的开源方法,涵盖从指令数据获取到专家级代码推理数据合成的完整流程,最终形成自主推理增强机制。
Understanding Gen Alpha Digital Language: Evaluation of LLM Safety Systems for Content Moderation
Abstract
arXiv:2505.10588v1 Announce Type: cross Abstract: This research offers a unique evaluation of how AI systems interpret the digital language of Generation Alpha (Gen Alpha, born 2010-2024). As the first cohort raised alongside AI, Gen Alpha faces new forms of online risk due to immersive digital engagement and a growing mismatch between their evolving communication and existing safety tools. Their distinct language, shaped by gaming, memes, and AI-driven trends, often conceals harmful interactions from both human moderators and automated systems. We assess four leading AI models (GPT-4, Claude, Gemini, and Llama 3) on their ability to detect masked harassment and manipulation within Gen Alpha discourse. Using a dataset of 100 recent expressions from gaming platforms, social media, and video content, the study reveals critical comprehension failures with direct implications for online safety. This work contributes: (1) a first-of-its-kind dataset capturing Gen Alpha expressions; (2) a framework to improve AI moderation systems for youth protection; (3) a multi-perspective evaluation including AI systems, human moderators, and parents, with direct input from Gen Alpha co-researchers; and (4) an analysis of how linguistic divergence increases youth vulnerability. Findings highlight the urgent need to redesign safety systems attuned to youth communication, especially given Gen Alpha reluctance to seek help when adults fail to understand their digital world. This study combines the insight of a Gen Alpha researcher with systematic academic analysis to address critical digital safety challenges.
摘要
本研究对人工智能系统如何解读α世代(Gen Alpha,2010-2024年出生群体)的数字语言进行了创新性评估。作为与AI共同成长的首个世代,α世代因深度数字参与及不断演变的沟通方式与现有安全工具之间的脱节,正面临新型网络风险。其由游戏、网络迷因和AI驱动趋势塑造的独特语言,往往使人类审核员与自动化系统都难以察觉有害互动。我们评估了四种主流AI模型(GPT-4、Claude、Gemini和Llama 3)在识别α世代话语中隐蔽骚扰与操控行为的能力。通过分析来自游戏平台、社交媒体和视频内容的100条最新表达数据集,研究揭示了直接影响网络安全的重大理解缺陷。本研究的贡献包括:(1)首个记录α世代表达特征的数据集;(2)改进青少年保护AI审核系统的框架;(3)涵盖AI系统、人类审核员及家长的多视角评估,并包含α世代合作研究者的直接反馈;(4)关于语言差异如何加剧青少年脆弱性的分析。研究结果强调:鉴于α世代在成年人无法理解其数字世界时往往不愿寻求帮助,亟需重新设计适应青少年沟通特点的安全系统。本研究结合α世代研究者的洞见与系统化学术分析,以应对关键的数字安全挑战。
Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI
Abstract
arXiv:2505.10472v1 Announce Type: cross Abstract: Effective communication about breast and cervical cancers remains a persistent health challenge, with significant gaps in public understanding of cancer prevention, screening, and treatment, potentially leading to delayed diagnoses and inadequate treatments. This study evaluates the capabilities and limitations of Large Language Models (LLMs) in generating accurate, safe, and accessible cancer-related information to support patient understanding. We evaluated five general-purpose and three medical LLMs using a mixed-methods evaluation framework across linguistic quality, safety and trustworthiness, and communication accessibility and affectiveness. Our approach utilized quantitative metrics, qualitative expert ratings, and statistical analysis using Welch's ANOVA, Games-Howell, and Hedges' g. Our results show that general-purpose LLMs produced outputs of higher linguistic quality and affectiveness, while medical LLMs demonstrate greater communication accessibility. However, medical LLMs tend to exhibit higher levels of potential harm, toxicity, and bias, reducing their performance in safety and trustworthiness. Our findings indicate a duality between domain-specific knowledge and safety in health communications. The results highlight the need for intentional model design with targeted improvements, particularly in mitigating harm and bias, and improving safety and affectiveness. This study provides a comprehensive evaluation of LLMs for cancer communication, offering critical insights for improving AI-generated health content and informing future development of accurate, safe, and accessible digital health tools.
摘要
关于乳腺癌和宫颈癌的有效传播仍是持续存在的健康挑战,公众对癌症预防、筛查和治疗的理解存在显著差距,可能导致延误诊断和治疗不足。本研究评估了大型语言模型(LLMs)在生成准确、安全且易于理解的癌症相关信息以支持患者认知方面的能力与局限。我们采用混合方法评估框架,从语言质量、安全可信度、传播可及性与情感效应三个维度,对五个通用LLMs和三个医学专用LLMs进行了评估。方法结合定量指标、定性专家评分及韦尔奇方差分析、Games-Howell检验和Hedges' g统计量。结果表明:通用LLMs在语言质量和情感效应上表现更优,而医学LLMs则展现出更强的传播可及性。然而,医学LLMs往往存在更高水平的潜在危害性、毒性及偏见,降低了其安全可信度表现。研究发现揭示了健康传播中领域专业知识与安全性之间的二元性,强调需要针对性地改进模型设计,特别是在减少危害偏见、提升安全性与情感效应方面。本研究为癌症信息传播的LLMs应用提供了全面评估,为改进AI生成健康内容及开发准确、安全、可及的数字健康工具提供了关键洞见。
MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices
Abstract
arXiv:2505.10607v1 Announce Type: cross Abstract: The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.
摘要
智能手机和物联网设备的日益普及,使得在资源受限的硬件上进行高效时间序列分析变得至关重要,这对人体活动识别和空气质量预测等传感应用尤为关键。尽管当前硬件感知的神经架构搜索(NAS)技术能针对特定平台自动发现架构,但尚未有研究专注于面向边缘部署的通用时间序列分析。本研究利用大语言模型(LLM)的问题解决与推理能力,提出创新框架MONAQ,将NAS重构为多目标神经架构查询任务。该框架配备多模态查询生成功能,可处理多模态时间序列输入与硬件约束,并通过基于LLM智能体的多目标搜索实现代码生成的部署就绪模型。通过整合数值数据、时间序列图像和文本描述,MONAQ显著提升了LLM对时间序列数据的理解能力。在十五个数据集上的实验表明,MONAQ发现的模型性能优于手工构建模型和NAS基线,同时具有更高的效率。
Towards an LLM-powered Social Digital Twinning Platform
Abstract
arXiv:2505.10681v1 Announce Type: cross Abstract: We present Social Digital Twinner, an innovative social simulation tool for exploring plausible effects of what-if scenarios in complex adaptive social systems. The architecture is composed of three seamlessly integrated parts: a data infrastructure featuring real-world data and a multi-dimensionally representative synthetic population of citizens, an LLM-enabled agent-based simulation engine, and a user interface that enable intuitive, natural language interactions with the simulation engine and the artificial agents (i.e. citizens). Social Digital Twinner facilitates real-time engagement and empowers stakeholders to collaboratively design, test, and refine intervention measures. The approach is promoting a data-driven and evidence-based approach to societal problem-solving. We demonstrate the tool's interactive capabilities by addressing the critical issue of youth school dropouts in Kragero, Norway, showcasing its ability to create and execute a dedicated social digital twin using natural language.
摘要
我们提出"社会数字孪生体"——一种创新的社会模拟工具,用于探索复杂自适应社会系统中假设情景的潜在影响。该架构由三个无缝集成的部分组成:包含真实世界数据和多维代表性合成人口的数据基础设施、基于大语言模型的智能体仿真引擎,以及支持用户通过自然语言与仿真引擎和人工智能体(即公民)进行直观交互的界面。社会数字孪生体支持实时参与,使利益相关者能够协作设计、测试和完善干预措施。该方法推动了一种数据驱动、循证决策的社会问题解决途径。我们以挪威克拉格勒市青少年辍学这一关键问题为例,展示了该工具通过自然语言创建并运行专属社会数字孪生体的交互能力。
The Hitchhikers Guide to Production-ready Trustworthy Foundation Model powered Software (FMware)
Abstract
arXiv:2505.10640v1 Announce Type: cross Abstract: Foundation Models (FMs) such as Large Language Models (LLMs) are reshaping the software industry by enabling FMware, systems that integrate these FMs as core components. In this KDD 2025 tutorial, we present a comprehensive exploration of FMware that combines a curated catalogue of challenges with real-world production concerns. We first discuss the state of research and practice in building FMware. We further examine the difficulties in selecting suitable models, aligning high-quality domain-specific data, engineering robust prompts, and orchestrating autonomous agents. We then address the complex journey from impressive demos to production-ready systems by outlining issues in system testing, optimization, deployment, and integration with legacy software. Drawing on our industrial experience and recent research in the area, we provide actionable insights and a technology roadmap for overcoming these challenges. Attendees will gain practical strategies to enable the creation of trustworthy FMware in the evolving technology landscape.
摘要
以大型语言模型(LLMs)为代表的基础模型(FMs)正在通过催生FMware(以这些FMs为核心组件的系统)重塑软件产业。在本届KDD 2025教程中,我们系统性地探讨了FMware,将精选的研究挑战目录与实际生产问题相结合。首先剖析了构建FMware的研究现状与实践经验,重点探讨了模型选型、领域专用高质量数据对齐、提示词工程优化以及自主智能体编排等技术难点。随后通过梳理系统测试、性能优化、部署实施及与传统软件集成等环节的关键问题,阐述了从演示原型到生产级系统的复杂演进路径。基于我们在该领域的工业实践与最新研究成果,提供了可操作的实施建议与技术路线图,助力应对这些挑战。参会者将获得在快速演进的技术环境中构建可信FMware的实用策略。
Automating Security Audit Using Large Language Model based Agent: An Exploration Experiment
Abstract
arXiv:2505.10732v1 Announce Type: cross Abstract: In the current rapidly changing digital environment, businesses are under constant stress to ensure that their systems are secured. Security audits help to maintain a strong security posture by ensuring that policies are in place, controls are implemented, gaps are identified for cybersecurity risks mitigation. However, audits are usually manual, requiring much time and costs. This paper looks at the possibility of developing a framework to leverage Large Language Models (LLMs) as an autonomous agent to execute part of the security audit, namely with the field audit. password policy compliance for Windows operating system. Through the conduct of an exploration experiment of using GPT-4 with Langchain, the agent executed the audit tasks by accurately flagging password policy violations and appeared to be more efficient than traditional manual audits. Despite its potential limitations in operational consistency in complex and dynamic environment, the framework suggests possibilities to extend further to real-time threat monitoring and compliance checks.
摘要
在当前快速变化的数字环境中,企业持续面临确保系统安全性的压力。安全审计通过确保策略落实、控制措施实施以及识别网络安全风险缓解缺口,有助于维持强大的安全态势。然而,审计通常依赖人工操作,需要耗费大量时间和成本。本文探讨了开发一种框架的可能性,利用大语言模型(LLMs)作为自主代理来执行部分安全审计工作,特别是针对Windows操作系统的密码策略合规性现场审计。通过采用GPT-4与Langchain进行探索性实验,该代理能够准确标记密码策略违规行为,执行审计任务时表现出比传统人工审计更高的效率。尽管在复杂动态环境中可能存在操作一致性的潜在限制,但该框架为扩展至实时威胁监测与合规性检查提供了可能性。
AI-enhanced semantic feature norms for 786 concepts
Abstract
arXiv:2505.10718v1 Announce Type: cross Abstract: Semantic feature norms have been foundational in the study of human conceptual knowledge, yet traditional methods face trade-offs between concept/feature coverage and verifiability of quality due to the labor-intensive nature of norming studies. Here, we introduce a novel approach that augments a dataset of human-generated feature norms with responses from large language models (LLMs) while verifying the quality of norms against reliable human judgments. We find that our AI-enhanced feature norm dataset, NOVA: Norms Optimized Via AI, shows much higher feature density and overlap among concepts while outperforming a comparable human-only norm dataset and word-embedding models in predicting people's semantic similarity judgments. Taken together, we demonstrate that human conceptual knowledge is richer than captured in previous norm datasets and show that, with proper validation, LLMs can serve as powerful tools for cognitive science research.
摘要
语义特征规范作为人类概念知识研究的基石,传统方法因规范研究需耗费大量人力,始终面临概念/特征覆盖度与质量可验证性之间的权衡。本研究提出一种创新方法,通过将大语言模型(LLMs)生成的特征响应与人工规范数据集相结合,并依据可靠的人类判断验证规范质量。研究发现,经AI增强的特征规范数据集NOVA(通过人工智能优化的规范)在概念间特征密度和重叠度上显著提升,同时在预测人类语义相似性判断任务中,其表现优于纯人工规范数据集及词嵌入模型。综合结果表明,人类概念知识比现有规范数据集所捕获的内容更为丰富,且经过适当验证后,大语言模型可成为认知科学研究的强有力工具。